Handling Class Imbalance Problem Using Feature Selection

نویسنده

  • Deepika Tiwari
چکیده

1 Introduction The class imbalance problem is a challenge to machine learning and data mining, and it has attracted significant research recent years. A classifier affected by the class imbalance problem for a specific data set would see strong accuracy overall but very poor performance on the minority class. The imbalance data sets are pervasive in real-world applications. Examples of these kinds of applications include biological data analysis, text classification, and image classification, web page classification among many others. The skew of an imbalanced data set can be severe. Some imbalanced data sets will only have one minority sample for every 100 majority samples. Researchers have crafted many techniques to combat the class imbalance problem, including resampling, new algorithms, and feature selection. With imbalanced data, classification rules that predict the small classes tend to be fewer and weaker than those that predict the prevalent classes; consequently, test samples belonging to the small classes are misclassified more often than those belonging to the prevalent classes. Standard classifiers usually perform poorly on imbalanced data sets because they are designed to generalize from training data and output the simplest hypothesis that best fits the data. Therefore, the simplest hypothesis pays less attention to rare cases. However, in many cases, identifying rare objects is of crucial importance; classification performances on the small classes are the main concerns in determining the property of a classification model. Why is the class imbalance problem so prevalent and difficult to overcome? First, modern classifiers assume that unseen data points on which the classifier will be asked to make a prediction are drawn from the same distribution as the training data. If testing and validation data samples were drawn from a different distribution, the trained classifier may give poor results because of the flawed model. Based on this assumption, a classifier will almost always produce poor accuracy on an imbalanced data set. In a bi-class application, the imbalanced problem is observed as one class is represented by a large amount of samples while the other is represented by only a few. The class with very few training samples and usually associated with high identification importance, is referred as the positive class; the other one as the negative class. The learning objective of this kind of data is to obtain a satisfactory identification performance on the positive (small) class. Researchers have crafted many techniques to combat the class imbalance problem, …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem

Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue...

متن کامل

Semi Supervised Under-sampling: a Solution to the Class Imbalance Problem for Classification and Feature Selection

Most medical datasets are not balanced in their class labels. Furthermore, in some cases it has been noticed that the given class labels do not accurately represent characteristics of the data record. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced. This is because they aim to optimize the overall accuracy without...

متن کامل

A Survey on Feature Selection Methods for Imbalanced Datasets

Class imbalance problem is one of the greatest challenges in machine learning and data mining researches, which has acquired significant research interest from academics, industries and research teams in recent years. Researchers have proposed many techniques to handle the class imbalance problem, including resampling, new algorithms, and feature selection. The class imbalance problem is even m...

متن کامل

Improving telemarketing Intelligence through Significant proportion of target Instances

In this paper we propose, develop, and test a new single-feature evaluator called Significant Proportion of Target Instances (SPTI) to handle the direct-marketing data with the class imbalance problem. The SPTI feature evaluator demonstrates its stability and outstanding performance through empirical experiments in which the real-world customer data of an e-recruitment firm are used. This resea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014